Das Ziel ist es, aus dem Datacamp Datensatz [Video Game Sales Date] (https://app.datacamp.com/workspace/datasets/dataset-python-video-games-sales), , folgende Fragestellung / Hypothese zu beantworten:
Als Einführung werden wir auf Datacamp folgende Kurse durchgehen:
# Import libraries
library("plotly")
Warning: Paket ‘plotly’ wurde unter R Version 4.1.3 erstellt
Lade nötiges Paket: ggplot2
Warning: Paket ‘ggplot2’ wurde unter R Version 4.1.3 erstellt
Attache Paket: ‘plotly’
Das folgende Objekt ist maskiert ‘package:ggplot2’:
last_plot
Das folgende Objekt ist maskiert ‘package:stats’:
filter
Das folgende Objekt ist maskiert ‘package:graphics’:
layout
library("ggplot2")
library("plyr")
Warning: Paket ‘plyr’ wurde unter R Version 4.1.3 erstellt
Attache Paket: ‘plyr’
Die folgenden Objekte sind maskiert von ‘package:plotly’:
arrange, mutate, rename, summarise
library("dplyr")
Warning: Paket ‘dplyr’ wurde unter R Version 4.1.3 erstellt
Attache Paket: ‘dplyr’
Die folgenden Objekte sind maskiert von ‘package:plyr’:
arrange, count, desc, failwith, id, mutate, rename, summarise, summarize
Die folgenden Objekte sind maskiert von ‘package:stats’:
filter, lag
Die folgenden Objekte sind maskiert von ‘package:base’:
intersect, setdiff, setequal, union
library("broom")
Warning: Paket ‘broom’ wurde unter R Version 4.1.3 erstellt
# Read csv from folder "data"
df = read.csv("./data/video_games_data.csv")
head(df, 10)
Bevor wir mit den Visualisierungen und Modelle beginnen können, müssen wir die Daten säubern. Das heisst es sollte keine Duplikate geben, fehlende Werte sollten korrekt eingetragen werden und Daten, die nicht verwendet werden sollten gelöscht werden.
Es hat “N/A” Werte in den Spalten “Year” und “Publisher”. Diese Werte sollten korrekte “NA” Werte sein, damit sie bei den Visualisierungen und Berechnungen nicht berücksichtigt werden.
# Show rows with "N/A" values
df[grep("N/A", df$Publisher),]
df[grep("N/A", df$Year),]
# Replace "N/A" with "NA"
df[df == "N/A"] <- NA
df <- df %>%
filter(df$Global_Sales > 0.1)
df
# Check if values have been converted
df %>%
summarize(across(everything(), ~sum(is.na(.))))
Da 2017 nur 3 Einträge und 2020 nur 1 Eintrag beinhaltet, werden wir diese Jahren nicht berücksichtigen und aus dem Dataframe löschen, weil sie nicht vollständig sind und so könnten unsere Modelle ungenau werden.
# Remove years 2017 and 2020 from dataset
df_clean <- df[!(df$Year == "2017" | df$Year == "2020"),]
View(df_clean)
# Remove columns: Rank, Game name, Platform, Global Sales because they are not needed for our thesis
# Set data to correct type
df_clean$Genre <- as.factor(df_clean$Genre)
df_clean$Year <- as.numeric(df_clean$Year)
na <- sum(df_clean[, 'NA_Sales'], na.rm = TRUE)
eu <- sum(df_clean[, 'EU_Sales'], na.rm = TRUE)
jp <- sum(df_clean[, 'JP_Sales'], na.rm = TRUE)
o <- sum(df_clean[, 'Other_Sales'], na.rm = TRUE)
g <- sum(df_clean[, 'Global_Sales'], na.rm = TRUE)
fig <- plot_ly(
y = c(na, eu, jp, o),
x = c("North America", "Europe", "Japan", "Other"),
type = 'bar',
width = 800
)
fig <- fig %>% layout(title = "Video Game Sales Overview",
xaxis = list(title = "Region"),
yaxis = list(title = "Sales (million)"))
fig
Wir sehen, dass Nord Amerika der grösste Markt ist.
# group by genre and summarize game sales to each region
df_genre <- df_clean %>%
group_by(Genre) %>%
summarize(
NA_Sales_Sum = sum(NA_Sales),
EU_Sales_Sum = sum(EU_Sales),
JP_Sales_Sum = sum(JP_Sales),
Other_Sales_Sum = sum(Other_Sales),
Global_Sales_Sum = sum(Global_Sales)
)
# Plot grouped bar chart video game sales by genre
fig <- plot_ly(
df_genre, y = ~Genre, x = ~NA_Sales_Sum, type = "bar", name = "North America", width = 1000, height = 800) %>%
add_trace(x = ~EU_Sales_Sum, name = "Europe") %>%
add_trace(x = ~JP_Sales_Sum, name = "Japan") %>%
add_trace(x = ~Other_Sales_Sum, name = "Other") %>%
layout(
title = "Video Game Sales by Genre",
xaxis = list(title = "Sales (million)"),
barmode = "group"
)
fig
Warning: Ignoring 1 observations
Warning: Ignoring 1 observations
Warning: Ignoring 1 observations
Warning: Ignoring 1 observations
Warning: Ignoring 1 observations
Warning: Ignoring 1 observations
Warning: Ignoring 1 observations
Warning: Ignoring 1 observations
Japan hat viel mehr Role-Play, Strategie und viel weniger Shooter und Action als die anderen Regionen
df_year <- df_clean %>%
group_by(Year) %>%
summarize(
NA_Sales_Sum = sum(NA_Sales),
EU_Sales_Sum = sum(EU_Sales),
JP_Sales_Sum = sum(JP_Sales),
Other_Sales_Sum = sum(Other_Sales),
Global_Sales_Sum = sum(Global_Sales)
)
fig <- plot_ly(
df_year, y = ~NA_Sales_Sum, x = ~Year, type = "bar", name = "North America", width = 900, height = 500) %>%
add_trace(y = ~EU_Sales_Sum, name = "Europe") %>%
add_trace(y = ~JP_Sales_Sum, name = "Japan") %>%
add_trace(y = ~Other_Sales_Sum, name = "Other") %>%
layout(
title = "Video Game from Sales by Year",
xaxis = list(title = "Year"),
yaxis = list(title = "Sales (million)"),
barmode = "stack"
)
fig
Warning: Ignoring 1 observations
Warning: Ignoring 1 observations
Warning: Ignoring 1 observations
Warning: Ignoring 1 observations
Warning: Ignoring 1 observations
Warning: Ignoring 1 observations
Warning: Ignoring 1 observations
Warning: Ignoring 1 observations
Da wir unsere Daten jetzt besser verstehen, können wir mit den Regressionsmodellen und mit der Beantwortung unserer Fragestellung beginnen.
# Calculate the average of sales of each genre from each region
df_sales_avg <- df_clean %>%
group_by(Genre) %>%
summarise(
EU_Sales_Avg = mean(EU_Sales),
NA_Sales_Avg = mean(NA_Sales),
JP_Sales_Avg = mean(JP_Sales),
Other_Sales_Avg = mean(Other_Sales),
Global_Sales_Avg = mean(Global_Sales))
df_sales_avg
Kann man anhand der nordamerikanischen Verkäufe
# Create DataFrame only with Action games
genre <- "Racing"
dF_action <- df_clean %>%
filter(Genre == genre)
ggplot(dF_action, aes(x=EU_Sales, y=NA_Sales)) +
geom_point() +
geom_smooth(method = "lm", se=FALSE) +
labs(title = genre)
`geom_smooth()` using formula 'y ~ x'
# Create linear model
mdl_action <- lm(EU_Sales ~ NA_Sales, data = dF_action)
# Extract model score
mdl_action %>%
glance() %>%
pull(r.squared)
[1] 0.7601522
# Color the SPLOM of NA_Sales, EU_Sales, and JP_Sales by nintendo
df_cleaner %>%
#mutate(nintendo = ifelse(Publisher == "Nintendo", "Nintendo", "Other")) %>%
plot_ly(color = ~Genre) %>%
add_trace(
type = 'splom',
dimensions = list(
list(label = 'N. America', values = ~NA_Sales),
list(label = 'Europe', values = ~EU_Sales),
list(label = 'Japan', values = ~JP_Sales)
)
)
Warning in RColorBrewer::brewer.pal(N, "Set2") :
n too large, allowed maximum for palette Set2 is 8
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(N, "Set2") :
n too large, allowed maximum for palette Set2 is 8
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(N, "Set2") :
n too large, allowed maximum for palette Set2 is 8
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(N, "Set2") :
n too large, allowed maximum for palette Set2 is 8
Returning the palette you asked for with that many colors
df_cleaner <- df_clean %>%
filter(Global_Sales > 0.8)
mdl_eu_global <- lm(EU_Sales ~ Global_Sales, data = df_cleaner)
mdl_eu_global
Call:
lm(formula = EU_Sales ~ Global_Sales, data = df_cleaner)
Coefficients:
(Intercept) Global_Sales
-0.004866 0.293020
mdl_eu_global_trans <- lm(sqrt(EU_Sales) ~ sqrt(Global_Sales), data = df_cleaner)
mdl_eu_global_trans
Call:
lm(formula = sqrt(EU_Sales) ~ sqrt(Global_Sales), data = df_cleaner)
Coefficients:
(Intercept) sqrt(Global_Sales)
-0.06338 0.54523
summary(mdl_eu_global)
Call:
lm(formula = EU_Sales ~ Global_Sales, data = df_cleaner)
Residuals:
Min 1Q Median 3Q Max
-8.2063 -0.2264 0.0066 0.1608 4.7804
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.004866 0.013175 -0.369 0.712
Global_Sales 0.293020 0.003166 92.554 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5399 on 2524 degrees of freedom
Multiple R-squared: 0.7724, Adjusted R-squared: 0.7723
F-statistic: 8566 on 1 and 2524 DF, p-value: < 2.2e-16
mdl_eu_global %>%
glance() %>%
pull(sigma)
[1] 0.2751684
ggplot(df_sales_avg, aes(x=NA_Sales_Avg, y=Global_Sales_Avg)) +
geom_point() +
geom_smooth(method = "lm", se=FALSE)
`geom_smooth()` using formula 'y ~ x'
Warning: Removed 1 rows containing non-finite values (stat_smooth).
Warning: Removed 1 rows containing missing values (geom_point).
mdl_na_global <- lm(NA_Sales_Avg ~ Global_Sales_Avg, data = df_sales_avg)
mdl_na_global
Call:
lm(formula = NA_Sales_Avg ~ Global_Sales_Avg, data = df_sales_avg)
Coefficients:
(Intercept) Global_Sales_Avg
-0.07453 0.58093
na <- tibble(Global_Sales_Avg = 9)
predict(mdl_na_global, na)
1
5.153829
ggplot(df_sales_avg, aes(x=JP_Sales_Avg, y=Global_Sales_Avg)) +
geom_point() +
geom_smooth(method = "lm", se=FALSE)
`geom_smooth()` using formula 'y ~ x'
Warning: Removed 1 rows containing non-finite values (stat_smooth).
Warning: Removed 1 rows containing missing values (geom_point).
mdl_jp_global <- lm(JP_Sales_Avg ~ Global_Sales_Avg, data = df_sales_avg)
mdl_jp_global
Call:
lm(formula = JP_Sales_Avg ~ Global_Sales_Avg, data = df_sales_avg)
Coefficients:
(Intercept) Global_Sales_Avg
0.05376 0.09446
jp <- tibble(Global_Sales_Avg = 9)
predict(mdl_jp_global, jp)
1
0.9038834
# Standardize
#df_clean[7 : 10] <- as.data.frame(scale(df_clean[7 : 10]))
#df_clean